Help Twitter Combat Hate Speech Using NLP and Machine Learning.

1. Load the tweets file using read_csv function from Pandas package

0 is positive tweet and 1 is negative tweet. There is a class imbalance problem here. 93 % of the tweets are non hate and 7% is hate.

2. Get the tweets into a list for easy text cleanup and manipulation.

3.To cleanup:

1.Normalize the casing.

2. Using regular expressions, remove user handles. These begin with '@’.

3.Using regular expressions, remove URLs.

4. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

5. Remove stop words.

6. Remove redundant terms like ‘amp’, ‘rt’, etc.

The match returns no results, which means '&' is removed.

7. Remove ‘#’ symbols from the tweet while retaining the term.

The match returns no results, which means '#' is removed.

4. Extra cleanup by removing terms with a length of 1.

5.Check out the top terms in the tweets:

1. First, get all the tokenized terms into one large list.

2.Use the counter and find the 10 most common terms.

6. Data formatting for predictive modeling:

1. Join the tokens back to form strings. This will be required for the vectorizers.

2. Assign x and y.

3. Perform train_test_split using sklearn.

7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.

1. Import TF-IDF vectorizer from sklearn

2. Instantiate with a maximum of 5000 terms in your vocabulary.

3.Fit and apply on the train set.

4. Apply on test set

8. Model building: Ordinary Logistic Regression

1. Instantiate Logistic Regression from sklearn with default parameters.

2. Fit into the train data.

3. Make predictions for the train and the test set.

9. Model evaluation: Accuracy, recall, and f_1 score.

1. Report the accuracy on the test set.

2. Report the recall and f1 score

The F1 score is 0.98 for class 0 (majority) and 0.44 for class 1 (minority).

10. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.

1. Adjust the appropriate class in the LogisticRegression model.

Our majority class(0) value count is 0.929854 and minority class(1) value count is 0.070146. to correct the class imbalance, We set a higher weight for minority class and reduce the weight for the majority class. Here we can set weights such that minority class is 13 times more than majority class.

11. Train again with the adjustment and evaluate.

The F1 score is 0.95 for class 0 (majority) and 0.50 for class 1 (minority).

12. Use a balanced class weight while instantiating the logistic regression.

The F1 score is 0.96 for class 0 (majority) and 0.54 for class 1 (minority).

13. Regularization and Hyperparameter tuning:

1. Import GridSearch and StratifiedKFold because of class imbalance.

using gridsearch and stratified kfold cv, we found the optimal weights which give the highest f1 score. Weight for class 1(minority class) is 0.880603. Weight for class 0(majority class) is 1-0.880603 which is 0.119397. Lets apply these weights and run the code once more.

The F1 score is 0.97 for class 0 (majority) and 0.61 for class 1 (minority). We tried to create a balance by getting decent f1 scores of both class 0 and class 1 using the gridsearch and stratified k-fold.

We can see that the model has indeed identified the racist and hate tweets correctly.

Conclusion

In this project, we worked on classfying the twitter tweets as non-hate and hate tweets. We did text processing by removing &,#,stop words,non alpha etc and we converted the tweets to tokens. Also, we did feature engineering by converting the words to vectors using tf-idf. Next we assigned X and y and split our data. We used logistic regression to train our model. We dealt with class imbalance by adjusting weights for the majority and minority classes. We tried using balanced weights to improve the f1 score. Lastly, we used gridsearch cv and stratified k-fold to find the optimal weights for the 0 and 1 classes. We were able to build a decent model with a f1 score of 97 for majority class and 61 for minority class.